Workspace setup:
library(tidyverse)
library(cowplot)
library(rethinking)
library(patchwork)
Last week, we started to build out multiple regression models, those that include both categorical and continuous variables as “main effects” or predictors in a model. These are simple multiple regression models, and they can be extremely useful for things like revealing spurious correlations – zero-order correlations that suggest association even when the two variables are not causally related – and important correlations that are masked by other variables.
However, you cannot interpret the coefficients in any multiple regression model without identifying the underlying causal model. This week, we’ll use Directed Acyclic Graphs (DAGs) to develop and visualize our causal models. These DAGs will then help us determine which variables, if any, to control for when trying to estimate causal pathways. Along the way, we’ll discuss some common mistakes when it comes to controls and their disasterious consequences.
This will also be a good opportunity to practice the mathematical models and code we’ve discussed before, but there will be little new code this week.
library(dagitty)
dag3.1 <- dagitty( "dag{ Z -> X; Z -> Y }" )
coordinates(dag3.1) <- list( x=c(X=-1,Z=0,Y=1) , y=c(X=0,Z=-1,Y=0) )
drawdag( dag3.1, cex = 3, lwd = 3 )
True confounds.
Not stratifying on (controlling for) Z will yield a spurious relationship between X and Y. That is, a correlation of X and Y (or regression) will be non-zero, even though there is no causal relationship from one to another.
data(WaffleDivorce, package = "rethinking")
d <- WaffleDivorce
Create two plots, one showing the relationship between marriage rate and divorce rate and another showing the relationship between median age at marriage and divorce rate.
p1 <- d %>% ggplot(aes(x = Marriage, y = Divorce)) +
geom_point(color = "#1c5253") +
geom_smooth(method = "lm", color = "black") +
labs(x = "marriage rate", y = "divorce rate")
p2 <- d %>% ggplot(aes(x = MedianAgeMarriage, y = Divorce)) +
geom_point(color = "#1c5253") +
geom_smooth(method = "lm", color = "black") +
labs(x = "median age at marriage", y = "divorce rate")
(p1 | p2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Model the relationship between Divorce Rate (D) and Marriage Rate (M). (Standardize both first.) Be sure to do the following:
Write a mathematical model expressing this relationship including priors.
Sample from your priors to evaluate them.
Calculate your posterior predictions for the relationship between D and M.
\[\begin{align*} D_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta_MM_i \\ \alpha &\sim \text{Normal}(0, 0.2) \\ \beta_M &\sim \text{Normal}(0, 0.5) \\ \sigma &\sim \text{Exponential}(1) \\ \end{align*}\]
d$D <- standardize(d$Divorce)
d$M <- standardize(d$Marriage)
flist <- alist(
D ~ dnorm( mu, sigma),
mu <- a + bM * M,
a ~ dnorm(0, 0.2),
bM ~ dnorm(0, 0.5),
sigma ~ dexp(1)
)
m3.1 <- quap(flist = flist, data = d)
priors <- extract.prior(m3.1)
mu <- link( m3.1 , post=priors , data=list( M=c(-2,3) ) )
plot( D ~ M , data=d, xlab = "marriage rate", ylab = "divorce rate" )
for ( i in 1:50 ) lines( c(-2,3) , mu[i,] , col=col.alpha("#1c5253",0.4) )
precis(m3.1)
## mean sd 5.5% 94.5%
## a -6.170141e-07 0.10824650 -0.1729994 0.1729982
## bM 3.500549e-01 0.12592756 0.1487984 0.5513115
## sigma 9.102663e-01 0.08986263 0.7666484 1.0538841
Bonus: a plot
# compute percentile interval of mean
M_seq <- seq( from=-3 , to=3.2 , length.out=30 )
mu <- link( m3.1 , data=list(M=M_seq) )
mu.mean <- apply( mu , 2, mean )
mu.PI <- apply( mu , 2 , PI )
# plot it all
plot( D ~ M , data=d , col= "#1c5253", xlab = "marriage rate", ylab = "divorce rate" )
lines( M_seq , mu.mean , lwd=2 )
shade( mu.PI , M_seq )
Now we’re going to incorporate state age (median age at marriage) into our model. This is the DAG proposed by RM. What does this DAG represent?
dag3.2 <- dagitty( "dag{ A -> D; A -> M; M -> D }" )
coordinates(dag3.2) <- list( x=c(A=0,D=1,M=2) , y=c(A=0,D=1,M=0) )
drawdag( dag3.2, cex = 3, lwd = 3 )
DAG models help us to see conditional independencies.
impliedConditionalIndependencies( dag3.2 ) #none
implication is that if we find any of these three variables are uncorrelated, our DAG is wrong.
How does this change with a new DAG?
dag3.3 <- dagitty( "dag{ A -> D; A -> M }" )
coordinates(dag3.3) <- list( x=c(A=0,D=1,M=2) , y=c(A=0,D=1,M=0) )
drawdag( dag3.3, cex = 3, lwd = 3 )
impliedConditionalIndependencies( dag3.3 )
## D _||_ M | A
implication that D and M will be independent after stratifying on A. We can test this.
\[\begin{align*} D_i &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= \alpha + \beta_AA_i + \beta_MM_i\\ \alpha &\sim \text{Normal}(0, 0.2) \\ \beta_A &\sim \text{Normal}(0, 0.5) \\ \beta_M &\sim \text{Normal}(0, 0.5) \\ \sigma &\sim \text{Exponential}(1) \\ \end{align*}\]
d$A <- standardize(d$MedianAgeMarriage)
flist <- alist(
D ~ dnorm( mu, sigma),
mu <- a + bA * A + bM * M,
a ~ dnorm(0, 0.2),
bA ~ dnorm(0, 0.5),
bM ~ dnorm(0, 0.5),
sigma ~ dexp(1)
)
m3.2 <- quap(flist = flist, data = d)
precis(m3.2)
## mean sd 5.5% 94.5%
## a 5.769481e-05 0.09708009 -0.1550950 0.1552104
## bA -6.136459e-01 0.15098900 -0.8549554 -0.3723363
## bM -6.547386e-02 0.15077934 -0.3064484 0.1755007
## sigma 7.851610e-01 0.07785390 0.6607354 0.9095866
plot( coeftab(m3.1,m3.2), par=c("bA","bM") )
If we want to simulate the effect of manipulating marriage, we use “do calculus.” We do this by effectively “deleting” the arrows going into our manipulation variable (M).
post <- extract.samples(m3.2) # get 10k samples of all parameters
n <- 1e4
As <- sample(d$A, size=n, replace=T) #sample from original data
# simulate D for M=0
DM0 <- with( post ,
rnorm(n, a + bM*0 + bA*As, sigma))
# simulate D for M=1 -- SAME A values
DM1 <- with( post ,
rnorm(n, a + bM*1 + bA*As, sigma))
#contrast
M10_con <- DM1 - DM0
dens(M10_con, lwd=4, col = "#1c5253", xlab="effect of 1SD increase in M")
A pipe in a DAG model represents a situation where a variable acts as a mediator between two other variables. In this context, the effect of one variable on another is transmitted through the mediator.
dag_pipe <- dagitty("dag{ X -> M -> Y }")
coordinates(dag_pipe) <- list(x=c(X=0, M=1, Y=2), y=c(X=0, M=1, Y=0))
drawdag(dag_pipe, cex = 3, lwd = 3)
One place that pipes show up is in post-treatment bias.
Suppose you are studying plants in a greenhouse, and you want to know how effective a particular fungal treatment is. Fungus on plants tends to reduce their growth. You plant a bunch of plants, measure them, then apply one of two different treatments. After some time, you measure the plants again, and you measure the amount of fungus on the plants.
Simulate some fake plant data.
set.seed(71)
# number of plants
N <- 100
# simulate initial heights
h0 <- rnorm(N,10,2)
# assign treatments and simulate fungus and growth
treatment <- rep( 0:1 , each=N/2 )
fungus <- rbinom( N , size=1 , prob=0.5 - treatment*0.4 )
h1 <- h0 + rnorm(N, 5 - 3*fungus)
# compose a clean data frame
d <- data.frame( h0=h0 , h1=h1 , treatment=treatment , fungus=fungus )
precis(d)
## mean sd 5.5% 94.5% histogram
## h0 9.95978 2.1011623 6.570328 13.07874 ▁▂▂▂▇▃▂▃▁▁▁▁
## h1 14.39920 2.6880870 10.618002 17.93369 ▁▁▃▇▇▇▁▁
## treatment 0.50000 0.5025189 0.000000 1.00000 ▇▁▁▁▁▁▁▁▁▇
## fungus 0.23000 0.4229526 0.000000 1.00000 ▇▁▁▁▁▁▁▁▁▂
Draw the dag that describes the relationships between these 4 variables.
What are the implied conditional independences?
plant_dag <- dagitty( "dag {
H_0 -> H_1
F -> H_1
T -> F}" )
coordinates( plant_dag ) <- list( x=c(H_0=1.0,T=0,F=0.5,H_1=1),
y=c(H_0=-.5,T=0,F=0.5,H_1=0) )
drawdag( plant_dag, cex = 3, lwd = 3 )
impliedConditionalIndependencies(plant_dag)
## F _||_ H_0
## H_0 _||_ T
## H_1 _||_ T | F
Let’s start by just modeling growth using our two height variables.
\[\begin{align*} h_{1,i} &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= h_{0,i} \times p \\ p &\sim \text{Log Normal}(0, .25) \\ \sigma &\sim \text{Exponential}(1) \end{align*}\]
flist <- alist(
h1 ~ dnorm(mu, sigma),
mu <- h0*p,
p ~ dlnorm(0, .25),
sigma ~ dexp(1)
)
m3.3 <- quap(flist, d)
precis(m3.3)
## mean sd 5.5% 94.5%
## p 1.426628 0.01759792 1.398503 1.454753
## sigma 1.792063 0.12496047 1.592352 1.991774
Now add treatment to this model.
\[\begin{align*} h_{1,i} &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= h_{0,i} \times p \\ p &= \alpha + \beta_TT_i \\ \alpha &\sim \text{Log Normal}(0, .25) \\ \beta_T &\sim \text{Normal}(0, .5) \\ \sigma &\sim \text{Exponential}(1) \end{align*}\]
flist <- alist(
h1 ~ dnorm(mu, sigma),
mu <- h0*p,
p <- a + bT * treatment,
a ~ dlnorm(0, .25),
bT ~ dnorm(0, .5),
sigma ~ dexp(1)
)
m3.4 <- quap(flist, d)
precis(m3.4)
## mean sd 5.5% 94.5%
## a 1.38168552 0.02519767 1.34141478 1.4219563
## bT 0.08366426 0.03431331 0.02882496 0.1385036
## sigma 1.74629961 0.12190865 1.55146604 1.9411332
Now add in both treatment and fungus to this model.
\[\begin{align*} h_{1,i} &\sim \text{Normal}(\mu_i, \sigma) \\ \mu_i &= h_{0,i} \times p \\ p &= \alpha + \beta_TT_i + \beta_FF_i \\ \alpha &\sim \text{Log Normal}(0, .25) \\ \beta_T &\sim \text{Normal}(0, .5) \\ \beta_F &\sim \text{Normal}(0, .5) \\ \sigma &\sim \text{Exponential}(1) \end{align*}\]
flist <- alist(
h1 ~ dnorm(mu, sigma),
mu <- h0*p,
p <- a + bT * treatment + bF * fungus,
a ~ dlnorm(0, .25),
bT ~ dnorm(0, .5),
bF ~ dnorm(0, .5),
sigma ~ dexp(1)
)
m3.4 <- quap(flist, d)
precis(m3.4)
## mean sd 5.5% 94.5%
## a 1.482826896 0.02452192 1.44363614 1.52201765
## bT 0.001060225 0.02987668 -0.04668849 0.04880894
## bF -0.267912187 0.03654981 -0.32632584 -0.20949853
## sigma 1.408656542 0.09859148 1.25108831 1.56622477
dag3.4 <- dagitty( "dag{ X -> Z; Y -> Z }" )
coordinates(dag3.4) <- list( x=c(X=-1,Z=0,Y=1) , y=c(X=0,Z=1,Y=0) )
drawdag( dag3.4, cex = 3, lwd = 3 )
Stratifying on Z opens up the association between X and Y. We do not want to stratify on Z.
We’ll use the rethinking package to simulate data:
d <- sim_happiness(seed = 1990, N_years = 1000)
precis(d)
## mean sd 5.5% 94.5% histogram
## age 3.300000e+01 18.7688832 4.000000 62.000000 ▇▇▇▇▇▇▇▇▇▇▇▇▇
## married 2.946154e-01 0.4560451 0.000000 1.000000 ▇▁▁▁▁▁▁▁▁▃
## happiness 6.832142e-19 1.2144211 -1.789474 1.789474 ▇▅▇▅▅▇▅▇
d %>%
mutate(married = factor(married, labels = c("unmarried", "married"))) %>%
ggplot(aes( x = age, y = happiness)) +
geom_point( aes( color = married), size = 3) +
scale_color_manual("",
values = c("unmarried" = "lightgrey",
"married" = "#1c5253")) +
theme(legend.position = "top")
Filter out people who are younger than 18. Then fit two models:
(You may want to center or standardize age in some way.)
d2 <- d[d$age >= 18, ]
d2$A <- standardize(d2$age)
d2$mid <- d2$married + 1
precis(d2)
## mean sd 5.5% 94.5% histogram
## age 4.150000e+01 13.8606201 20.000000 63.000000 ▃▇▇▇▇▇▇▇▇▇
## married 3.989583e-01 0.4899394 0.000000 1.000000 ▇▁▁▁▁▁▁▁▁▅
## happiness 9.251859e-19 1.2145867 -1.789474 1.789474 ▇▅▇▅▅▇▅▇
## A -6.476301e-18 1.0000000 -1.551157 1.551157 ▃▇▇▇▇▇▇▃
## mid 1.398958e+00 0.4899394 1.000000 2.000000 ▇▁▁▁▁▁▁▁▁▅
flist <- alist(
happiness ~ dnorm( mu, sigma ),
mu <- a[mid] + bA*A,
a[mid] ~ dnorm( 0, 0.5),
bA ~ dnorm( 0, 0.25),
sigma ~ dexp(1)
)
m3.5 <- quap(flist, d2)
precis(m3.5, depth=2)
## mean sd 5.5% 94.5%
## a[1] -0.5905624 0.04197727 -0.6576502 -0.5234746
## a[2] 0.8866633 0.05191218 0.8036976 0.9696290
## bA -0.2136768 0.03329633 -0.2668908 -0.1604629
## sigma 0.9928218 0.02264183 0.9566358 1.0290078
flist <- alist(
happiness ~ dnorm( mu, sigma ),
mu <- a + bA*A,
a ~ dnorm( 0, 0.5),
bA ~ dnorm( 0, 0.25),
sigma ~ dexp(1)
)
m3.6 <- quap(flist, d2)
precis(m3.6, depth=2)
## mean sd 5.5% 94.5%
## a 7.215763e-07 0.03903552 -0.06238558 0.06238703
## bA -1.864706e-07 0.03870314 -0.06185527 0.06185490
## sigma 1.213174e+00 0.02766004 1.16896820 1.25738038
Causal DAGs make strong assumptions about unobserved confounders, but analyzing them can still provide valuable insights about where our models might be wrong through testing implied relationships.
Conditional independencies are key testable implications of a DAG, representing pairs of variables that should show no association after controlling for specific sets of other variables.
We can identify conditional independencies using the same path analysis techniques used for finding backdoor paths - examining all paths between two variables and determining if there exists a conditioning set that blocks all paths.
While manually deriving conditional independencies in large graphs is complex due to the many variable pairs and paths to consider, computational tools can efficiently perform these calculations.